AI-ASSISTED DIGITALISATION OF HISTORICAL DOCUMENTS

نویسندگان

چکیده

Abstract. Preserving historical archival heritage involves not only physical measures to safeguard these valuable texts but also providing for their digital preservation. However, merely digitising manuscripts and codexes is enough. A further step needed: the digitalisation of content, i.e. verbatim transcription scanned texts. This process enables accurate preservation textual making it easier search information conduct analyses. With help artificial intelligence, particularly Deep Neural Networks (DNNs), automatic handwriting recognition can be performed. In this study, we employed a Convolutional Recurrent Network (CRNN), an established type DNN, determine minimum amount labelled data required automatically transcribe five different datasets that vary in language time period. The results show Character Error Rate (CER) lower than 10% achieved with just few hundred text lines almost all cases.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Flexible Computer Assisted Transcription of Historical Documents Through Subword Spotting

In the absence of accurate handwriting recognition for historical documents, computer assisted transcription (CAT) methods move into the spotlight. We explore some of the weaknesses of current CAT systems and propose a CAT system which relies on subword spotting that overcomes most of these. The system is ideal crowdsourcing transcription to mobile users.

متن کامل

Historical Documents Modernization

Historical documents are mostly accessible to scholars specialized in the period in which the document originated. In order to increase their accessibility to a broader audience and help in the preservation of the cultural heritage, we propose a method to modernized these documents. This method is based in statistical machine translation, and aims at translating historical documents into a mode...

متن کامل

Unsupervised Transcription of Historical Documents

We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially...

متن کامل

Exploiting Collection Level for Improving Assisted Handwritten Words Transcription of Historical Documents

Transcription of handwritten words in historical documents is still a difficult task. When processing huge amount of pages, document centered approaches are limited by the trade-off between automatic recognition errors and the tedious aspect of human user annotation work. In this article, we investigate the use of inter page dependencies to overcome those limitations. For this, we propose a new...

متن کامل

Mining dates from historical documents

The essential quality of information in a digital library is accessibility. Full text search is not enough for some collections, more can be done. Historical collections, for example, contain dates, and it would be useful to historians to be able to search by them. However, these dates occur anywhere within the text of historical documents, and to be searched they must be extracted from the doc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences

سال: 2023

ISSN: ['1682-1777', '1682-1750', '2194-9034']

DOI: https://doi.org/10.5194/isprs-archives-xlviii-m-2-2023-557-2023